System and method for real-time detection of anomalies in database usage

ABSTRACT

ABSTRACT A system and method for real-time detection of anomalies in database or application usage is disclosed. Embodiments provide a mechanism to detect anomalies in database or application usage, such as data exfiltration attempts, first by identifying correlations (e.g., patterns of normalcy) in events across different heterogeneous data streams (such as those associated with ordinary, authorized and benign database usage, workstation usage, user behavior or application usage) and second by identifying deviations/anomalies from these patterns of normalcy across data streams in real-time as data is being accessed. An alert is issued upon detection of an anomaly, wherein a type of alert is determined based on a characteristic of the detected anomaly.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of U.S. patentapplication Ser. No. 14/732,162 filed on Jun. 5, 2015 which claimspriority of U.S. Provisional Application Ser. No. 62/009,736, filed onJun. 9, 2014, which is hereby incorporated herein by reference in itsentirety.

FIELD

Embodiments are in the technical field of database or application usage.More particularly, embodiments disclosed herein relate to systems andmethods for real-time detection of anomalies in database or applicationusage which, inter alia, foster discovery of patterns of behavior andanomalies from across a plurality of heterogeneous data streams inreal-time in order to detect anomalies that could not have been detectedby monitoring any single data stream alone.

BACKGROUND

Significant damage, both reported and unreported, has been caused toenterprises, government agencies, and national security through “insiderthreat” attacks, especially data exfiltration. For example, considerrecent retrieval and release of secret and top-secret information fromthe defense and intelligence communities by Manning and Snowden. Dataexfiltration attacks may include, e.g., intentional user activity (e.g.,a user impermissibly downloads sensitive data and removes the data fromthe enterprise) or automated activity (e.g., malicious software operateson behalf of or as a user with or without the user's knowledge).Unfortunately, the current state of the art for addressing such problemsis quite limited.

1. Monitoring of all System Data: Due to the large volume of data thatis required to monitor ALL users and systems in an enterprise, this data(often only portions of the required data) is usually stored andanalyzed off-line in a database or data warehouse. Unfortunately, thisanalysis only reveals issues after the fact, i.e., after any actual dataexfiltration attempts have occurred and at such a point in time when itmay be too late to take any action or to prevent damage from theexfiltration. In a best case, the offending user or system is stillthere and may be expected to perform such actions again, where they canbe targeted for further analysis and prosecution. However, many times,the offending user or system is no longer present, the damage hasalready been done or it is otherwise too late to take effective action.

2. Real-time Monitoring of Suspected Individuals: If an individual issuspected of malicious activity, then real-time monitoring mechanismscan be configured and installed to directly monitor that individual'sactivity and detect any malicious activity. These mechanisms, however,are time consuming to install and also require dedicated analysts toconduct the real-time monitoring and detection, often at great expenseto the affected enterprise.

As noted above, there are significant problems with the aboveapproaches. The above approaches are slow to react, may not catch a userin time, and have a hard time detecting malicious software that isinstalled and operating on behalf of a user without the user's knowledgeor authorization. Furthermore, the above approaches to addressing dataexfiltration problems are expensive to deploy and consume human analysttime and resources.

The insider threat remains one of the most significant problemsconfronting enterprises and government agencies of all sizes today. Thethreat is multi-faceted with a high degree of variability in theperpetrator, the type of attack, the intent of the attack, and theaccess means. No solution today adequately addresses the detection ofinsider threats due to the highly variable nature of the problem.

No existing systems or solutions takes user, database, application, andnetwork activity all into account at the same time while using eventprocessing techniques to discover patterns of behavior and anomaliesfrom across these plurality of data streams in real-time in order todetect anomalies that could not have been detected by monitoring anysingle data stream alone.

Thus, it is desirable to provide a system and method for real-timedetection of anomalies in database or application usage which are ableto overcome the above disadvantages.

SUMMARY

Embodiments are directed to a method for real-time detection ofanomalies. The method comprises: receiving a plurality of heterogeneousdata streams, wherein the heterogeneous data streams are received fromat least two of a group consisting of agents located at databases,agents located at applications, audit programs located at userworkstations, and sensors located in, or at access points to, a network;correlating the heterogeneous data streams, wherein the correlationidentifies corresponding events in different ones of the heterogeneousdata streams; identifying patterns of events across the correlatedheterogeneous data streams; building a model of normalcy from theidentified pattern of events, wherein the model of normalcy is stored inan analysis database; creating rules that determine how and whetheranomalies are detected, how a detected anomaly is treated andcharacterized, and what reaction to employ upon detection of theanomaly; receiving a plurality of additional heterogeneous data streamsfrom the at least two of a group consisting of the agents, auditprograms, and sensors; applying, using an analysis engine, the model ofnormalcy and rules to the additional heterogeneous data streams andanalyzing data from the additional heterogeneous data streams againstthe model of normalcy and rules; detecting an anomaly in real-time bydetermining whether an anomalous event is present, by the application ofthe rules and whether events, in relation to other events within theadditional heterogeneous data streams, fit or do not fit the model ofnormalcy; determining at least one characteristic of the detectedanomaly; and issuing an alert upon detection of the anomaly, wherein atype of alert is determined based on the at least one determinedcharacteristic of the detected anomaly. The detected anomaly isindicative of unauthorized manipulation or falsification of data,sabotage of a database, or exfiltration of data.

In an embodiment, the correlating is performed using a complex eventprocessor for con⁻ elating the heterogeneous data streams andintegrating the heterogeneous data streams into a single integrated datastream. The correlating may include synchronizing the time of each ofthe heterogeneous data streams in order to correlate events across time.The correlating may be performed by application of a small-spacealgorithm (SSA). In the applying step, the analysis engine may use thecomplex event processor for applying the model of normalcy and rules tothe additional heterogeneous data streams and for analyzing data fromthe additional heterogeneous data streams against the model of normalcyand rules.

In an embodiment, the heterogeneous data streams and/or the additionalheterogeneous data streams may be processed using an automatic eventtabulator and correlator (AETAC) algorithm to reduce event datacomplexity and facilitate search, retrieval, and correlation of theevent data to thereby produce uniform event data whereby complex eventprocessing of the heterogeneous data streams and/or the additionalheterogeneous data streams by the complex event processor is simplified.

In an embodiment, the identified pattern of events may be associatedwith ordinary, authorized, and benign database usage, workstation usage,user behavior or application usage.

In an embodiment, the heterogeneous data streams may be multi-modalasynchronous signals.

In an embodiment, one of the heterogeneous data streams corresponds toan ordinary, authorized, and benign database query and another of theheterogeneous data streams corresponds to an ordinary, authorized, andbenign user interaction at the user workstation, and wherein theidentified pattern of events is the ordinary, authorized, and benigndatabase query and the ordinary, authorized, and benign user interactionat the user workstation. The detecting step may detect the anomaly whenan event within the additional heterogeneous data streams correspondingto a database query is not preceded by a user interaction at the userworkstation within a predetermined period of time resulting in the eventnot fitting the model of normalcy.

In an embodiment, the alert may comprise at least one of a groupconsisting of alarm message, a communication triggering further analysisand/or action, a command instructing the restriction or shutting down ofan affected workstation, database, network or network access, initiationof additional targeted monitoring, analysis, and/or applications tocapture additional detailed information regarding an attack, continuedmonitoring of a user, placement of a flag in a file for furtherfollow-up, restricting access to a network, alerting security, andrestricting or locking down a building or a portion of a building.

In an embodiment, in the receiving a plurality of heterogeneous datastreams step, the heterogeneous data streams may be received by theanalysis engine.

In an embodiment, the databases, applications, workstations, or networksmay be in an enterprise environment.

Embodiments are also directed to system for real-time detection ofanomalies. The system includes one or more computers, each computerincluding a processor and memory, wherein the memory includesinstructions that are executed by the processor for performing theabove-mentioned method.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description will refer to the following drawings, whereinlike numerals refer to like elements, and wherein:

FIG. 1 is a block diagram illustrating an embodiment of a system forreal-time detection of anomalies in database or application usage.

FIG. 2 is a block diagram illustrating an example of multiple datastream event correlation and anomaly detection.

FIGS. 3A and 3B are flowcharts illustrating an embodiment of a methodfor real-time detection of anomalies in database or application usage.

FIG. 4 is a block diagram illustrating exemplary hardware forimplementing an embodiment of a system and method for real-timedetection of anomalies in database or application usage.

FIG. 5 is a graph illustrating space compression (SC) versus epsilon.

FIG. 6 is a graph illustrating false negative rate (FNR) versus epsilon.

DETAILED DESCRIPTION

It is to be understood that the figures and descriptions of the presentinvention may have been simplified to illustrate elements that arerelevant for a clear understanding of the present invention, whileeliminating, for purposes of clarity, other elements found in a typicaldatabase system or typical method of using a database. Those of ordinaryskill in the art will recognize that other elements may be desirableand/or required in order to implement the present invention. However,because such elements are well known in the art, and because they do notfacilitate a better understanding of the present invention, a discussionof such elements is not provided herein. It is also to be understoodthat the drawings included herewith only provide diagrammaticrepresentations of the presently preferred structures of the presentinvention and that structures falling within the scope of the presentinvention may include structures different than those shown in thedrawings. Reference will now be made to the drawings wherein likestructures are provided with like reference designations.

Described herein are embodiments of a system and method for real-timedetection of anomalies in database or application usage. Embodimentsovercome the problems described above. Embodiments address the aboveproblems and improves the state of the art by providing an automatedmechanism to continuously monitor ALL users and systems forabnormalities, and automatically alerting on violations and deviationsfrom expected behaviors as they are occurring in real time withoutincurring heavy overhead expenses in human time and labor. Embodimentsprovide a system and method that takes user, database, application, andnetwork activity all into account at the same time while using eventprocessing techniques to discover patterns of behavior and anomaliesfrom across these data streams in real-time in order to detect anomaliesthat could not have been detected by monitoring any single data stream.

Embodiments provide a mechanism to detect anomalies in database accessand usage, such as data exfiltration attempts, first by identifyingcorrelations (e.g., patterns or models of normalcy) in events acrossdifferent relevant yet heterogeneous data streams (such as thoseassociated with ordinary, authorized and benign database usage,workstation usage, user behavior or application usage) and second byidentifying deviations from these patterns or models of normalcy acrossdata streams in real time as data is being accessed.

Embodiments identify and alert in real-time (i. e., as events areoccurring, not after the fact) insider threat attacks targeting databaseand systems using databases, such as file storage and sharing systems.Such threats may involve unauthorized manipulation and falsification ofdata, sabotage of databases, and exfiltration of data. Data exfiltrationrefers to users (or agents and/or software systems acting on behalf ofusers, possibly unknown to the user) illicitly accessing, retrieving,and downloading data that is confidential and proprietary to anenterprise, often with the malicious intent of distributing the dataoutside the enterprise for personal gain or simply detriment to theenterprise.

Embodiments of a system and method for real-time detection of anomaliesin database or application usage may include a mechanism and processesthat provide and analyze, in real-time, a variety of heterogeneousstreams within the enterprise comprising an amalgam of relevant eventspertaining to data access, user behavior, and computer and networkactivity. Taken together, these event streams can identify anomaloussystem behavior that is indicative of insider threats and dataexfiltration. Embodiments can identify anomalous system behavior thatcannot be identified from analyzing any single event stream on its own.

Such relevant enterprise event streams that may be monitored andanalyzed by embodiments described herein include, but are not limitedto:

Data Access

-   -   File-system access    -   Meta-data concerning Database I/O    -   Content of database queries (such as SQL queries) and responses

User Behavior

-   -   User logon/logoff events    -   User keyboard/mouse events, including system commands and        application interaction    -   Email activity (SMTP)    -   Web activity

Computer Activity

-   -   Process activity    -   Local I/O activity (e.g., USB ports)

Network Activity

-   -   packet capture (PCAP) data from user machine    -   PCAP data from servers

Using complex event processing (CEP) and unsupervised or semi-supervisedmachine learning techniques, embodiments of the system and methoddevelop models of normalcy correlating the events derived from theselected enterprise streams (e.g., the above-identified streams) basedon typical (authorized and benign) behavior of the users, computers,databases, applications, and networks. These models are then used byembodiments of the system and method to identify anomalies in the eventstreams that may be predictive or indicative of insider threat attacks,including unauthorized data manipulation, falsification, sabotage anddata exfiltration.

A key inventive feature of embodiments described herein is theapplication of automated machine learning concepts for correlatingevents and detecting anomalies across heterogeneous data streams to thespecific data streams relating to database, user, application, computer,and network activity in order to detect insider threats and dataexfiltration attempts taking all the available information into accountwhile it is happening in real time. Monitoring of any individual streamalone will not detect all anomalous events. Existing technologies do notprovide these advantages. For example:

1. Traditional insider threat detection systems such as Raytheon'sInnerView and SureView™ record and store user and computer activity foroff-line static analysis;

2. Database monitoring: Stand-alone systems such as IBM® InfoSphereGuardium monitor and analyze database activity. However, such systems donot take other user, application, or network activity into account.

3. User monitoring: Stand-alone systems such as Centrifytt DirectAuditmonitor and analyze user activity in real time on a workstation ordesktop. However, such systems do not take other application, database,or network activity into account.

4. Network activity. Stand-alone systems such as SNORT monitor andanalyze network activity in an enterprise network. Such systems do nottake user activity, application, or database activity into account.

No system takes user, database, application, and network activity allinto account at the same time while using event processing techniques todiscover patterns of behavior and anomalies from across these datastreams in real-time in order to detect anomalies that could not havebeen detected by monitoring any single data stream.

With reference now to FIG. 1, shown is a block diagram illustratingcomponents of an embodiment of a system 100 for real-time detection ofanomalies in database or application usage. System 100 may include ananalysis engine 102 and an analysis database 104. Analysis engine 102may receive data streams, such as the data streams described above. Forexample, analysis engine 102 may receive data streams indicative ofuser, database or other data access, application, computer and networkbehavior and activity. Such data streams may be generated and receivedfrom various agents, sensors and audit programs located at workstations,in networks or at network access points, data storage (e.g., database)locations, and data processing (e.g., application) locations.

For example, system 100 may include monitoring agents (A), such as IBMGuardium agents, located at (operating on) databases 106, agents (A)located at applications (e.g., SharePoint) 107, direct audit programs(DA), such as Centrify® DirectAudit, located at (operating on) userworkstations 108, and network sensors (NS), such as OpenNMS, located in,or at access points to, a network(s) 110. Each type of agent, program orsensor may produce different types of data streams. For example, IBM®Guardium agents may generate data streams indicative of databaseinteraction on a database server including a timestamp, client machineIP, database user ID, database server IP, and the database query (SQLquery). Centrify® DirectAudit may generate a data stream indicative ofuser interaction on a user machine/workstation including a timestamp,machine user ID and user commands (e.g., as typed into TTY/Shell). Theoutput of some agents, programs and sensors may be modified to work withembodiments described herein. For example, some agents, programs andsensors produce GUI output. Scripts and other mechanisms to extractrelevant data and output to, e.g., syslogger, may be used. Analysisengine 102 may include a complex event processor (CEP) that correlatesthe multiple data streams and integrates such data streams into anintegrated data stream. Such correlation may include synchronizing thetime of each data stream in order to correlate events across time. Thestreams of data may include raw, meta and derived data. The CEP platformmay ingest multi-modal asynchronous signals from heterogeneous sources.The analysis engines 102 CEP platform may then apply the models ofnormalcy to the integrated data stream.

With continuing reference to FIG. 1, analysis engine 102 may receive andprocess such data streams to (a) determine models or patterns ofnormalcy and (b) to analyze real-time behavior and activity against suchmodels or patterns of normalcy in order to detect anomalies. Inanalyzing real-time behavior and activity, analysis engine 102 detectsevents in the data streams, compares the events, or more particularly,the patterns of events in the data streams, against the models ofnormalcy per rules that are based on such models and designed to enablethe analysis engine 102 to determine when a variance from the model ofnormalcy is indicative of an anomaly, and, when an anomaly is detectedas a result of such comparison and application of such rules, issues analert. An alert may include an alarm message(s), communications torelevant personal triggering further analysis and/or action, commandsinstructing the shutting down of the affected workstation, database,network or network access, or starting of more targeted monitoring andanalysis systems, applications and/or other efforts to capture moredetailed information regarding an attack. Results of detection analysismay be stored in analysis database 104.

To build the models of normalcy, analysis engine 102 receives datastreams that result from controlled, known typical (authorized andbenign) behavior of the users, computers, databases, applications, andnetworks, analyzes such data streams to determine the pattern of eventsresulting from such typical behavior and builds a model of the patternsof events occurring during such typical behavior. Analysis engine 102may generate the aforementioned rules based on the models of normalcybuilt from such patterns. Ordinarily, the greater the amount of suchtypical behavior that is analyzed and used to build models of normalcy,the larger number of typical patterns of events may be recognized andincorporated into the models. Likewise, greater refinement in the modelsmay be achieved, for example by building models for subsets of typicalbehavior—e.g., typical behavior under normal operating conditions,typical behavior under emergency operating conditions, typical behaviorunder off-hours operating conditions, etc. In other words, embodimentsof system 100 may build different models of normalcy that are applicableto different operating conditions and which are applied by analysisengine 102 according to the prevalent operating condition. The models ofnormalcy and rules may be stored in analysis database 104.

To summarize, embodiments may identify correlations or patterns ofbehavior through a variety of monitoring agents, programs and sensorsand then identify anomalies by detecting deviations from the patterns inreal-time.

Embodiments of system 100 may develop the models of normalcy and rulesthrough machine learning techniques applied by the CEP platform. Suchmachine learning techniques may be unsupervised or semi-supervised. Assystem 100 operates, analysis engine 102 may continue to apply machinelearning techniques to further update the models and rules.

Experiments demonstrate that anomalies often cannot be detected bylooking at a single event. Rather, analyzing a pattern or sequence ofevents and comparing to other patterns or sequence of events can detectanomalies not detectable by looking at a single event. For example, adatabase query by itself would not appear to be anomalous. However, whenmost database queries are preceded by a user interaction (e.g., inputinto a keyboard), a database query that is not so preceded would likelybe anomalous. To illustrate, consider the following scenario:

A user makes regular (authorized) queries to a database.

-   -   Keyboard is used to enter queries

An Advanced Persistent Threat (APT) malware is installed on the user'smachine.

The APT makes an (unauthorized) query to the database.

-   -   Keyboard is NOT used        Detecting the threat may include the analysis engine 102        recognizing that

Pattern: Database query is preceded by user interaction

Anomaly: Database query is NOT preceded by user interaction

Such pattern recognition often requires a correlation between the datastreams received from different agents, programs and sensors. Withreference now to FIG. 2, shown is an illustration of data streamsreceived from a user X workstation audit program, e.g., Centrify®DirectAudit, and a database monitoring agent, e.g., IBM®s Guardium. Asshown, the audit program stream shows the keyboard interactions of userX. The database monitoring agent stream may include numerous dataqueries, some from the user X workstation and others from otherworkstations, etc. The CEP platform of analysis engine 102 may correlatethe keyboard interactions from the user X audit data stream with thedata queries shown by the agent data stream. In so doing, analysisengine 102 may determine that one of the data queries from user X is notcorrelated with/preceded by a keyboard interaction on user x machine. Ifthe applicable model of normalcy indicates that typical data queries arealways preceded by/correlated with a keyboard interaction, analysisengine 102 may characterize the uncorrelated data query as an anomaly(and, therefore, issue an alert).

The correlation between events in different data streams may be quitegranular. For example, a model or pattern of normalcy may dictate that acertain event from one data stream is always preceded within, e.g., five(5) seconds, by a certain event from a second data stream. Likewise, themodel of normalcy may dictate that the event from one data stream isalways preceded by one or more of a variety of events from a second datastream. Embodiments may apply a time window to detect correlations.Embodiments may increase or decrease the time window used in order toincrease or decrease the potential number of correlations.

Embodiments of analysis engine 102 may apply a small-space algorithm(SSA) to process the data streams and correlate events. SSA is a newform of stream processing over distributed massive streams. SSAestimates frequently occurring items on a logarithmic space scale(tractable) and permits online extraction of persistent objects in astreaming network. SSA was developed by Prof. Srikanta Tirthapura atIowa State University. SSA identifies persistent events in data streams.For the purposes of this description, an event is time-stamped data. Apersistent event is time-stamped data that appears regularly over time.Characteristics of data streams typically require that all algorithmsoperate on data in a single pass. Events may be sparse, occur onlyinfrequently, or even appear in different distributed streams.Embodiments may use statistical data sampling to reduce size of streamwithout overlooking persistent events. SSA determines associationsbetween events in the data stream. Such associations may be temporal,spatial or generalized Associations over other metrics. SSA makes use of“Frequency Moments” that estimate the total number of objects in astream without having to search the entire stream. Importantly, SSA canlearn the persistent events in a data stream without any prior knowledgeand without having to track all of the events in the data stream.Implementations of SSA perform association rule mining to exploit priorand collateral domain knowledge to increase the selectivity of eventpersistence detection. This decreases false negative errors and increasethe ability to detect more transient events.

Embodiments identify anomalous behavior by finding associations (orcorrelations) between events occurring in different (distributed,heterogeneous) data streams. Embodiments identify “patterns of normalcy”and monitor events for disruption from these patterns. Data for makingthe predictions may come from sensor networks (e.g., agents, auditprograms, network and other sensors) generating heterogeneous streams ofobservations (‘event’ loosely defined here as ‘time-stamped data’). Achallenge is to detect and recognize, from sensor samples, precursorevents for “hidden” spatiotemporal processes.

The following describes applicable metrics of SSA that are applicable toembodiments of the system and method for real-time detection ofanomalies in database or application usage described herein. Differentalgorithms identify those events in a stream occurring with a givenfrequency value α (alpha). A naive algorithm identifies all a-persistentevents given unlimited resources (memory). The SSA algorithm identifiesmost of the persistent events at an accuracy rate determined by a givenε value (epsilon). Space compression and false negative rate aredirectly proportional to the chosen epsilon value as shown below and inFIGS. 5 and 6.

A “naïve” algorithm is a baseline or obvious way to perform a task,contrasted with the present embodiment's algorithm, which improves,optimizes, or otherwise enhances the naive algorithm. The naivealgorithm for SSA is merely to sample and count every packet in thestream. But these streams tend to be too large to be exhaustivelysampled on current computing platforms, so SSA proposes a methodinvolving subsampling to estimate the total counts, optimizing storagespace at the expense of accuracy. The epsilon parameter is used to tuneSSA in order to achieve the optimal trade-off between accuracy andstorage space.

Space Compression (SC):

$\frac{\# \mspace{14mu} {of}\mspace{14mu} {tuples}\mspace{14mu} {created}\mspace{14mu} {by}\mspace{14mu} {naïve}\mspace{14mu} {algorithm}}{\# \mspace{14mu} {of}\mspace{14mu} {tuples}\mspace{14mu} {created}\mspace{14mu} {by}\mspace{14mu} {SSA}}$

False Negative Rate (FNR):

$\frac{\mspace{11mu} \begin{matrix}{\# \mspace{14mu} {of}\mspace{14mu} \alpha \text{-}{persistent}\mspace{14mu} {objects}\mspace{14mu} {reported}\mspace{14mu} {by}} \\{{naïve}\mspace{14mu} {algorithm}\mspace{14mu} {that}\mspace{14mu} {were}\mspace{14mu} {not}\mspace{14mu} {reported}\mspace{14mu} {by}\mspace{14mu} {SSA}}\end{matrix}\;}{\# \mspace{14mu} {of}\mspace{14mu} \alpha \text{-}{persistent}\mspace{14mu} {objects}\mspace{14mu} {reported}\mspace{14mu} {by}\mspace{14mu} {naïve}\mspace{14mu} {algorithm}}$

The ε value (epsilon) controls the trade-off between these twoquantities:

-   -   High ε yields high compression, but also high FNR    -   Low ε yields low FNR, but also low compression

Caveats: Sensitive to the distribution of a-persistent items in thestream

α-persistent: a is percentage of monitored timeslots in which objectoccurs FIGS. 5 and 6 respectively graphically illustrate spacecompression (SC) and false negative rate (FNR) versus epsilon. Asillustrated in FIGS. 5 and 6, these graphs show that SC and FNR increaseroughly linearly with respect to epsilon. The graphs assume α=0.2 withinsider threat events occurring in 18-minute intervals.

In certain implementations that utilize heterogeneous data streams,correlation of events received from a plurality of various heterogeneousdata streams often requires certain processing, modification andmanipulation of the raw stream data. This processing, etc., enablesembodiments of the system and method for real-time detection ofanomalies in database or application usage to correlate heterogeneousdata streams, detect events, correlate events across data streams andotherwise perform real-time detection.

An embodiment of a system and method for processing of heterogeneousdata streams may be referred to as an automatic event tabulator andcorrelator (AETAC). Embodiments of AETAC include an algorithm thatautomatically tabulates and correlates event data collected by sensorsand other automated data collection devices. Embodiments of AETAC canprocess events of all types uniformly, even if the data definitions foreach device are different (i.e., “heterogeneous”). Embodiments of AETACoperate by imposing a mathematic structure (“homomorphism”) on eachevent type that makes all events look the same to the tabulating deviceand then further imposing a requirement that this structure be preservedthrough successive processing steps (“closure”), such that all outputsfrom the tabulator have the same mathematical structure as the inputevents. This uniformity reduces data complexity and facilitatessearches, retrievals and correlation of event data at any stage ofprocessing, such that complex event processing (CEP) may be reduced toabstractions equivalent to evaluating simple mathematical expressions.

A purpose of AETAC is to improve the performance of complex eventprocessing (CEP) systems that monitor large networks of sensors or otherkinds of data collection devices. Embodiments of AETAC reduce oreliminate the need to write customized code for each device thatcollects data. Embodiments achieve this simplification by imposingseveral mild restrictions on the allowable formats of the data, which donot impede the functioning of the collection devices. The resultinguniformity makes it easier to compare events across space and time and,consequently, increases the overall “situational awareness” of networksorganized under AETAC principles.

AETAC reduces complexity by imposing a simple structure on events thatrequires the data conform to these mild restrictions:

1) The events recognized by the sensor can be described as lists ofnames and values;

2) Each generated list is discrete and unique, in the sense that itbears a unique identifier and timestamp. (This is easily accomplished byusing one-up sequences or hash codes);

3) The lists can be decomposed into deterministic types, such that thecomposition of each list is fixed for each type. Note that thisrequirement is automatically fulfilled for most sensors, which usestandardized packet protocols for defining data fields.

These restrictions establish an algebraic structure (homomorphic mappingwith closure property) such that ID, TIMESTAMP, TYPE can serve as anintrinsic structure for any event. Additional fields generated areaggregated in such a manner that they form a DETAIL object, indexed bythe ID. These DETAIL objects do not have to be uniquely defined, but areguaranteed to be a surjective mapping (one to many) or injective mapping(one to one) because the ID's are uniquely defined. DETAIL fields may beoptional (equivalent to endomorphic mapping) if the ID, TIMESTAMP, TYPEfields suffice to define an event completely.

The closure property simply means that all transformations of events orresults obtained through processing must also conform to the aboverestrictions. This guarantees that all operators used on the inputevents can also be applied recursively to outputs of transformations andother processing results.

For example, a classification of an event would generate a“classification event”→(ID, TIMESTAMP, CLASSIFICATION EVENT) with aDETAIL record containing the results of the classification.

Once events in a data stream have been defined, conforming to the aboverestrictions, it is then possible to process events arriving throughmultiple, heterogeneous channels using a basic AETAC algorithm. (Note:in embodiments, events must be ordered by timestamp such that eacharrival channel is an ordered time-series):

-   -   Step 1: Select earliest event from available channels using a        harmonized timestamp field;    -   Step 2: Assign a unique ID (to be used as unique key to DETAIL        object, if any, step 4);    -   Step 3: Assign a TYPE to event tuple, insuring that event tuple        is fixed with respect to TYPE;    -   Step 4: Aggregate any remaining fields in DETAIL object, indexed        by ID (step 2); and    -   Step 5: Repeat steps 1 through 4 until all events are defined.

Complex event processing (CEP) generally processes many kinds of eventsfrom a variety of sensors and collection devices. Each device poses anintegration problem involving writing software to accept, validating andinterpreting the input data and determine what actions need to be donefor each sensor.

Embodiments of AETAC simplify this process by providing a sharedmathematical structure to standardize the creation of event handlers andoptimizing the reuse of code for allowing many different kinds ofdevices to use exactly the same code to process its data.

Furthermore, due to the mathematical closure property of embodiments ofAETAC, the results and outputs of processing will also share this samestructure and so can allow each layer of the system to automaticallyfeed to the next layer recursively.

An innovative concept of embodiments of AETAC is the application ofalgebraic structure on the inputs and outputs of event processing, whichallows AETAC to be an automatic ‘tabulator’ device for events, which canbe easily correlated with other events because of the shared structureand closure properties.

Existing systems tend to become specialized in the symbols and formatsused to characterize each system. As such, existing systems aredependent on specialized features, which may not be uniformly orconsistently applied.

An analogous example of this kind of specialization is Roman Numerals,where for historical reasons, natural numbers can be represented andmanipulated mathematically as sequential combinations of these sevensymbols: I,V,X,L,C,D,M. The laws for combining these symbols are notapplied consistently. For example, the immediate successor to any numberis usually obtained by concatenating the symbol “I” to the number. Sothe successor of “I” is “II,” and the successor of “V” is “VI”. Thesuccessors for “III” and “VIII,” however, are computed by prefixing “I”to the next larger symbol, namely “IV” and “IX,” respectively. RomanNumerals are not complete. There are no Roman symbols for representingzero or negative numbers. Consequently, there is no simple or automaticway to tabulate a collection of Roman Numerals into another collectionof Roman Numerals without a lot of complicated rules and processing.

Modern Arabic numerals, on the other hand, are written usingcombinations of the ten symbols 0,1,2,3,4,5,6,7,8,9. The successor toany number is simply and consistently determined by addition tableswhich are applied to any range of numbers using simple rules ofarithmetic. Further, the Arabic numeral system is complete in the sensethat any real number can be expressed, including zero and negativenumbers. All outputs of calculations can be used as inputs to succeedingcalculations in a completely ‘mechanical’ fashion.

In this same sense, embodiments of AETAC impose an algebraic structureon event processing that makes the processing independent of the type ofevent being processed, and insures consistent and uniform processing ofevents, where outputs are new kinds of events requiring furtherprocessing.

With reference now to FIGS. 3A-3B, shown is are flowcharts illustratingan embodiment of a method 300 for real-time detection of anomalies indatabase or application usage. FIG. 3A illustrates portion or section ofmethod 300 for real-time detection of anomalies in database orapplication usage that builds model of normalcy and rules as describedabove. FIG. 3B illustrates portion or section of method 300 forreal-time detection of anomalies in database or application usage thatapplies model of normalcy and rules to detect anomalies. Method 300 maybe repeated to continuously update the model of normalcy and rulesthrough, e.g., machine learning techniques, as described above.Likewise, method 300 may be performed continuously as data streams arereceived to continuously detect anomalies in real-time. Method 300 mayimplement SSA as described to build model of normalcy and rules and todetect anomalies.

With continuing reference to FIG. 3A, method 300 receives a plurality ofdata streams, block 302. The received data streams may be heterogeneousdata streams received from a plurality of agents, programs, and/orsensors, etc. The data streams are correlated, block 304. Thecorrelation 304 may include processing, e.g., with an embodiment ofAETAC. The correlation 304 identifies events in the various datastreams. Method 300 identifies patterns of events across the variousdata streams, block 306. The patterns may provide indications ofrelations between events in different data streams under typicaloperating conditions. Method 300 builds/creates a model or pattern ofnormalcy from the identified patterns of events, block 308. Utilizingthe model of normalcy, method 300 may build/create rules, block 310,that determine how and whether anomalies are detected, how method 300treats, characterizes and reacts to a detected anomaly, etc. Forexample, an event that may be characterized as an anomaly when occurringoff-hours may not be an anomaly or an anomaly worth issuing an alert ifoccurring during normal business hours. Method 300 may repeat 302-310,block 312, over time using machine learning techniques to continue tobuild and update 308 the model of normalcy and build and update 310 therules.

With reference again to FIG. 3B, method 300 may apply the model ofnormalcy and rules to operational behavior to detect anomalies. Method300 receives a plurality of data streams 314. The received data streams,typically heterogeneous from a plurality of agents, programs, and/orsensors, etc., are processed (e.g., using an embodiment of AETAC) andthe data from the data streams is analyzed against the model of normalcyand the rules, block 316. Based on the rules and how events, in relationto other events, fit or do not fit the model of normalcy, method 300determines whether events are anomalous, thereby detecting anomalies,block 318. If an anomaly is detected 318, method 300 may determine thecharacteristics of the anomalous event(s), block 320. For example, ananomaly may be a user accessing a secured server after hours. If theuser does access a secured server from time to time after hours, such ananomaly may not trigger an alert. If, however, the user is accessing theserver from his office after hours but the user did not “badge in”(i.e., user's employee badge was not read by security at entrance to thebuilding), then the anomaly would trigger an alert. Consequently, therules and the characteristics of an anomaly may determine whether toissue an alert, block 322. For example, an anomaly indicating improperaccess to a server may only trigger continued monitoring of the user.The characteristics of an anomaly may also determine what type of alertto issue. Some anomalies may require a simple flag in a file for furtherfollow-up by a human agent monitoring anomalies. Other anomalies mayrequire immediate action, such as restricting or shutting off access toa network, locking down a building or portion of a building, alertingsecurity, etc. If method 300 determines to issue an alert and the typeof alert is determined, the alert is issued, block 324. Method 300 maycontinue to repeat 314 to 324, block 326, so long as systems, etc., arebeing monitored.

With reference now to FIG. 4, shown is a block diagram of exemplaryhardware that may be used to provide system 100 and perform method 300for real-time detection of anomalies in database or application usage.Exemplary hardware implementation of system 100 may include multiplecomputing devices 400 (e.g., computing system N). Computing devices 400may be, e.g., blade servers or other stack servers. For example, eachcomponent shown in system 100 may be implemented as software running onone or more computing devices 400. Alternatively, components andfunctionality of each may be combined and implemented as softwarerunning on a single computing device 400. Furthermore, steps of method300 may be implemented as software modules executed on one or morecomputing devices 400.

Computing device 400 may include a memory 402, a secondary storagedevice 404, a processor 406, and a network connection 408. Computingdevice 400 may be connected a display device 410 (e.g., a terminalconnected to multiple computing devices 400) and output device 412.Memory 402 may include RAM or similar types of memory, and it may storeone or more applications (e.g., software for performing functions orincluding software modules described herein) for execution by processor406. Secondary storage device 404 may include a hard disk drive, DVD-ROMdrive, or other types of non-volatile data storage. Processor 406executes the applications, which are stored in memory 402 or secondarystorage 404, or received from the Internet or other network 414. Networkconnection 408 may include any device connecting computing device 400 toa network 414 and through which information is received and throughwhich information (e.g., analysis results) is transmitted to othercomputing devices. Network connection 408 may include network connectionproviding connection to internal enterprise network, network connectionprovided connection to Internet or other similar connection. Networkconnection 408 may also include bus connections providing connections toother computing devices 400 in system 100 (e.g., other servers in serverstack).

Display device 410 may include any type of device for presenting visualinformation such as, for example, a computer monitor or flat-screendisplay. Output device 412 may include any type of device for presentinga hard copy of information, such as a printer, and other types of outputdevices include speakers or any device for providing information inaudio form. Computing device 400 may also include input device, such askeyboard or mouse, permitting direct input into computing device 400.

Computing device 400 may store a database structure in secondary storage404 for example, for storing and maintaining information needed or usedby the software stored on computing device 400. Also, processor 402 mayexecute one or more software applications in order to provide thefunctions described in this specification, specifically in the methodsdescribed above, and the processing may be implemented in software, suchas software modules, for execution by computers or other machines. Theprocessing may provide and support web pages and other user interfaces.

Although computing device 400 is depicted with various components, oneskilled in the art will appreciate that the servers can containadditional or different components. In addition, although aspects of animplementation consistent with the above are described as being storedin memory, one skilled in the art will appreciate that these aspects canalso be stored on or read from other types of computer program productsor computer-readable media. The computer-readable media may includeinstructions for controlling a computer system, such as computing device400, to perform a particular method, such as method 300.

More generally, even though the present disclosure and exemplaryembodiments are described above with reference to the examples accordingto the accompanying drawings, it is to be understood that they are notrestricted thereto. Rather, it is apparent to those skilled in the artthat the disclosed embodiments can be modified in many ways withoutdeparting from the scope of the disclosure herein. Moreover, the termsand descriptions used herein are set forth by way of illustration onlyand are not meant as limitations. Those skilled in the art willrecognize that many variations are possible within the spirit and scopeof the disclosure as defined in the following claims, and theirequivalents, in which all terms are to be understood in their broadestpossible sense unless otherwise indicated.

What is claimed is:
 1. A method for real-time detection of anomaliesoccurring in an enterprise computer network, comprising: receiving aplurality of heterogeneous data streams from sources in the network, thesources including two levels, first level sources and second levelsources, wherein the first level sources include one or more selectedfrom a group consisting of agents located at databases, agents locatedat applications, audit programs located at user workstations, sensorslocated in the network, and sensors located at access points to thenetwork, wherein the second level sources include one or more selectedfrom a group consisting of data access, user behavior, computer activityand network activity, and wherein the first level sources monitor eventstreams of the second level sources and generate data streams indicativeof corresponding second level source activity in a uniform format;processing the heterogeneous data streams obtained by combining at leasttwo of the first level sources to identify events therein, each eventbeing identified by at least a unique ID, a timestamp, and an eventtype, wherein the processing of the heterogeneous data streams includescombining at least two of the first level sources into a single datastream; correlating the processed heterogeneous data streams to form anintegrated data stream comprising a plurality of identified events;detecting the existence and at least one characteristic of an anomaly inthe computer network by application of a predetermined model of normalcyand one or more anomaly rules to the integrated data stream comprisingthe plurality of identified events; and issuing an alert based on the atleast one characteristic of the anomaly.
 2. The method of claim 1further comprising creating the predetermined model of normalcy and theone or more anomaly rules by: receiving additional data comprising theplurality of heterogeneous data streams, wherein the additional datacorresponds to authorized and benign usage of network resources;processing the heterogeneous data streams to identify events therein,each even being identified by at least a unique ID, a timestamp, and anevent type; correlating the processed data streams to form an integrateddata stream comprising a plurality of identified events; identifying oneor more patterns from relations between identified events comprising theintegrated data stream; and creating the model of normalcy and the oneor more anomaly rules based on the identified one or more patterns. 3.The method of claim 1 wherein the one or more anomaly rules relate to atleast one of how and whether anomalies are detected, how a detectedanomaly is treated and characterized, and what reaction to employ inresponse to the detected anomaly.
 4. The method of claim 1 furthercomprising: estimating a number or frequency of one or more event typesin the processed data stream without searching the entire processed datastream; and determining one or more temporal, spatial, or generalizedassociations between a plurality of events in the processed data stream.5. The method of claim 1 wherein the detected anomaly is indicative ofunauthorized manipulation or falsification of data, sabotage of adatabase, or exfiltration of data.
 6. The method of claim 1 wherein theheterogeneous data streams comprise multi-modal asynchronous signals. 7.The method of claim 1 wherein the program code includes an algorithmthat detects and extracts persistent events among the plurality ofidentified events in at least one of the plurality of heterogeneous datastreams, and wherein the persistent events are time-stamped data thatappear regularly over time.
 8. The method of claim 7 wherein thepersistent events appear in different distributed streams among theplurality of heterogeneous data streams.
 9. The method of claim 7wherein the at least one of the plurality of heterogeneous data streamsis statistically sampled to reduce stream size of the at least one ofthe plurality of heterogeneous data streams, without overlooking thepersistent events.
 10. A method for real-time detection of anomaliesoccurring in an enterprise computer network, comprising: receiving aplurality of heterogeneous data streams from sources in the network, thesources including two levels, first level sources and second levelsources, wherein the first level sources include one or more selectedfrom a group consisting of agents located at databases; agents locatedat applications; audit programs located at user workstations; sensorslocated in the network; and sensors located at access points to thenetwork, wherein the second level sources include one or more selectedfrom a group consisting of data access, user behavior, computer activityand network activity, and wherein the first level sources monitor eventstreams of the second level sources and generate data streams indicativeof corresponding second level source activity in a uniform format;processing the heterogeneous data streams obtained by combining at leasttwo of the first level sources to identify events therein, each eventbeing identified by at least a unique ID, a timestamp, and an eventtype, wherein the processing of the heterogeneous data streams includes:combining at least two of the first level sources into a single datastream; and operating on the single data stream using an algorithm thatidentifies spatiotemporal relationships; correlating the processedheterogeneous data streams to form an integrated data stream comprisinga plurality of identified events; detecting the existence and at leastone characteristic of an anomaly in the computer network by applicationof a predetermined model of normalcy and one or more anomaly rules tothe integrated data stream comprising the plurality of identifiedevents; and issuing an alert based on the at least one characteristic ofthe anomaly.
 11. The method of claim 10 further comprising creating thepredetermined model of normalcy and the one or more anomaly rules by:receiving additional data comprising the plurality of heterogeneous datastreams, wherein the additional data corresponds to authorized andbenign usage of network resources; processing the heterogeneous datastreams to identify events therein, each even being identified by atleast a unique ID, a timestamp, and an event type; correlating theprocessed data streams to form an integrated data stream comprising aplurality of identified events; identifying one or more patterns fromrelations between identified events comprising the integrated datastream; and creating the model of normalcy and the one or more anomalyrules based on the identified one or more patterns.
 12. The method ofclaim 10 wherein the one or more anomaly rules relate to at least one ofhow and whether anomalies are detected, how a detected anomaly istreated and characterized, and what reaction to employ in response tothe detected anomaly.
 13. The method of claim 10 further comprising:estimating a number or frequency of one or more event types in theprocessed data stream without searching the entire processed datastream; and determining one or more temporal, spatial, or generalizedassociations between a plurality of events in the processed data stream.14. The method of claim 10 wherein the detected anomaly is indicative ofunauthorized manipulation or falsification of data, sabotage of adatabase, or exfiltration of data.
 15. The method of claim 10 whereinthe heterogeneous data streams comprise multi-modal asynchronoussignals.
 16. The method of claim 10 wherein the program code includes analgorithm that detects and extracts persistent events among theplurality of identified events in at least one of the plurality ofheterogeneous data streams, and wherein the persistent events aretime-stamped data that appear regularly over time.
 17. The method ofclaim 16 wherein the persistent events appear in different distributedstreams among the plurality of heterogeneous data streams.
 18. Themethod of claim 16 wherein the at least one of the plurality ofheterogeneous data streams is statistically sampled to reduce streamsize of the at least one of the plurality of heterogeneous data streams,without overlooking the persistent events.
 19. A method for real-timedetection of anomalies occurring in a computer network, comprising:receiving a plurality of heterogeneous data streams from sources in thenetwork, the sources including first level sources and second levelsources, wherein the first level sources include one or more selectedfrom a group consisting of agents located at databases, agents locatedat applications, audit programs located at user workstations, sensorslocated in the network, and sensors located at access points to thenetwork; wherein the second level sources include event streams to beanalyzed, wherein the first level sources monitor the event streams ofthe second level sources and generate data streams indicative ofcorresponding second level source activity in a uniform format, andwherein each of the heterogeneous data streams is obtained by combiningat least two of the first level sources into a data stream; processingthe heterogeneous data streams to identify events therein, each eventbeing identified by at least a unique ID, a timestamp, and an eventtype; correlating the processed heterogeneous data streams to form anintegrated data stream comprising a plurality of identified events;detecting the existence and at least one characteristic of an anomaly inthe computer network by application of a predetermined model of normalcyand one or more anomaly rules to the integrated data stream comprisingthe plurality of identified events; and issuing an alert based on the atleast one characteristic of the anomaly.
 20. The method of claim 19wherein the second level sources include one or more selected from thegroup consisting of data access, user behavior, computer activity andnetwork activity.